[New Scheduler] Add duration checker #4984

style95 · 2020-09-24T23:59:52Z

Description

This is a subsequent PR of #4983, once #4983 is merged, I would rebase this again.

Major changes are:

This is to add a duration checker for ElasticSearch.
With a new scheduler, it is important to decide when and how many containers to add.
The scheduler will calculate the average duration for the recent N activations and compute the processing power of one container, e.g. how many activations can be handled by one container in a given time. Factoring in the average duration, the number of incoming activations, and the number of activations in a queue, the scheduler can add more containers to handle the given activations.

Related issue and scope

I opened an issue to propose and discuss this change ([In-progress] New scheduling components #4922)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

style95 · 2020-09-25T00:01:32Z

.../src/main/scala/org/apache/openwhisk/core/scheduler/queue/ElasticSearchDurationChecker.scala

+  }
+}
+
+trait DurationCheckerProvider extends Spi {


It is based on the SPI.

bdoyle0182 · 2020-09-25T01:23:43Z

.../src/main/scala/org/apache/openwhisk/core/scheduler/queue/ElasticSearchDurationChecker.scala

+
+    actionMetaData.binding match {
+      case Some(binding) =>
+        client


the Some and None cases can call a helper function since the only difference in the query is the List to match on and pass that List as a param.

updated accordingly

bdoyle0182 · 2020-09-25T01:27:39Z

...scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/NoopDurationChecker.scala

+
+import scala.concurrent.Future
+
+object NoopDurationCheckerProvider extends DurationCheckerProvider {


so this means you operate the schedule without using the average activation duration for an action heuristic. How important is using the heuristic for the performance of the scheduler?

This is just for other DBs such as CouchDB or CosmosDB in case the scheduler is used with other than ES.
(Even if it is highly recommended to use with ES.)

Regarding the average duration, it is important to improve the accuracy of calculation but the queue can still work without it. When an action is newly created, there is no activation accordingly no average duration.
In such a case, it assumes one container can handle one activation for the given time.
So even if one container can handle multiple activations for a given period, it assumes a container can handle only one activation, so schedulers would tend to overprovision containers.

In the case of couchdb or cosmosdb, there is no average activation duration calculation since it uses this correct?

Yes.
If required, anyone can create it as it is based on SPI.

And one thing I forgot to tell you is, after this duration checker is landed, we introduced one more optimization.
Initially, the average duration was always calculated based on this module, but now(in our downstream), this is only used when a queue is newly created. After then, the queue uses the duration passed from containers.
As per POEM2, each container autonomously pulls an activation by sending a fetch-request. So when they send the fetch-request, we added one more field lastDuration. So the queue can keep the recent N duration in the circular queue and calculate the average duration without any external API call.

But when a new queue is created or an action is newly created, there is no data in the circular queue and the duration checker is used in such cases.

ah okay cool so if I understand what you just said correctly once the invoker is running, we still will get to use the average activation duration heuristic since we track it in memory. The elasticsearch spi is just for startup. That's good to know the optimization you described sounds more important we'll still get the benefits of tracking activation duration

bdoyle0182 · 2020-09-29T06:51:50Z

LGTM

codecov-io · 2020-11-19T14:53:13Z

Codecov Report

Merging #4984 (307527e) into master (2d0c8a7) will increase coverage by 46.80%.
The diff coverage is 77.27%.

@@             Coverage Diff             @@
##           master    #4984       +/-   ##
===========================================
+ Coverage   29.09%   75.89%   +46.80%     
===========================================
  Files         195      206       +11     
  Lines        9553    10122      +569     
  Branches      413      450       +37     
===========================================
+ Hits         2779     7682     +4903     
+ Misses       6774     2440     -4334

Impacted Files	Coverage Δ
...isk/core/scheduler/queue/NoopDurationChecker.scala	`0.00% <0.00%> (ø)`
...scheduler/queue/ElasticSearchDurationChecker.scala	`78.84% <78.84%> (ø)`
.../scala/org/apache/openwhisk/core/WhiskConfig.scala	`95.30% <100.00%> (+8.13%)`	⬆️
...e/elasticsearch/ElasticSearchActivationStore.scala	`84.66% <100.00%> (+84.66%)`	⬆️
...e/openwhisk/core/containerpool/ContainerPool.scala	`89.51% <0.00%> (-2.02%)`	⬇️
.../core/monitoring/metrics/PrometheusEventsApi.scala	`90.90% <0.00%> (ø)`
...nwhisk/core/monitoring/metrics/KamonRecorder.scala	`82.45% <0.00%> (ø)`
...nwhisk/core/monitoring/metrics/EventConsumer.scala	`89.23% <0.00%> (ø)`
...rg/apache/openwhisk/core/scheduler/Scheduler.scala	`0.00% <0.00%> (ø)`
...penwhisk/core/monitoring/metrics/MetricNames.scala	`100.00% <0.00%> (ø)`
... and 150 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d0c8a7...307527e. Read the comment docs.

style95 · 2020-11-21T14:53:22Z

It seems sometimes the build is failed even if it passed all tests because it failed to upload the file.

All checks passed.
Compressing logs dir...
tar: Removing leading `/' from member names
.
.
.
Uploading to Box...
Posting result <Response [500]>
{
  "error": "File upload failed."
}

Is this to upload the log files?
If yes, where can I refer to?

style95 · 2020-11-22T11:26:47Z

It keeps failing.

Uploading to Box...
Posting result <Response [500]>
{
  "error": "File upload failed."
}

style95 · 2020-11-22T12:51:59Z

It seems this endpoint no longer accepts any file.

$ curl -XPOST http://DamCYhF8.mybluemix.net/upload?name=system -H 'Content-Type: application/gzip' -F "file1=@log-upload-test.zip" -v
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 169.62.254.79...
* TCP_NODELAY set
* Connected to DamCYhF8.mybluemix.net (169.62.254.79) port 80 (#0)
> POST /upload?name=system HTTP/1.1
> Host: DamCYhF8.mybluemix.net
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 538
> Expect: 100-continue
> Content-Type: application/gzip; boundary=------------------------5207d7f4ef7c0247
>
< HTTP/1.1 100 Continue
< X-Note: Gateway Ack
< HTTP/1.1 500 Internal Server Error
< X-Backside-Transport: FAIL FAIL
< Connection: Keep-Alive
< Transfer-Encoding: chunked
< Content-Type: application/json
< Date: Sun, 22 Nov 2020 12:48:34 GMT
< X-Global-Transaction-ID: cb47d0745fba5e1fd6a964c3
* HTTP error before end of send, stop sending
<
{
  "error": "File upload failed."
}

Could this be because the disk space is fully used?

@dgrove-oss @rabbah
Do you have any idea about this?
I have no idea to whom the endpoint belongs.

dgrove-oss · 2020-11-22T20:32:42Z

We failed to upload the logs to Box. This happens because we don't have a way to automatically remove old logs and the Box folder fills up. However, the upload to Box is optional (will not cause the travisci job to fail). There is a real test failure earlier in the log: https://travis-ci.org/github/apache/openwhisk/jobs/744844186#L7457

rabbah · 2020-11-26T02:18:23Z

I will change this upload to use a different object store - will give it a try soon.

…

On Sun, Nov 22, 2020 at 3:32 PM David Grove ***@***.***> wrote: We failed to upload the logs to Box. This happens because we don't have a way to automatically remove old logs and the Box folder fills up. However, the upload to Box is optional (will not cause the travisci job to fail). There is a real test failure earlier in the log: https://travis-ci.org/github/apache/openwhisk/jobs/744844186#L7457 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4984 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF25MXSECTY3YXBH6YOU5DSRFYPPANCNFSM4RY4XZXQ> .

style95 · 2021-01-18T08:02:32Z

Finally, it has passed all tests cases :)

- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.

style95 added the scheduler label Sep 24, 2020

style95 commented Sep 25, 2020

View reviewed changes

bdoyle0182 reviewed Sep 25, 2020

View reviewed changes

bdoyle0182 approved these changes Oct 5, 2020

View reviewed changes

style95 closed this Oct 8, 2020

style95 reopened this Oct 8, 2020

style95 force-pushed the add-duration-checker branch from 851d440 to 8e698ae Compare November 14, 2020 08:19

style95 force-pushed the add-duration-checker branch from 8e698ae to 80fef2f Compare November 20, 2020 09:06

style95 force-pushed the add-duration-checker branch from a4ca2c0 to 0032850 Compare January 15, 2021 05:30

style95 and others added 9 commits January 18, 2021 10:23

Add a duration checker for Elasticsearch.

7269649

Add configurations for the ElasticSearchDurationCheckerTests class

5b274a2

Use a private helper function to execute queries.

4a18953

Add an Ansible variable for the duration checker.

99993b6

Apply scalaFmt

0a2d670

Include test cases for duration checker to system tests.

8209e63

Setup ElasticSearch for system tests.

c8a9ab7

Increase patience config to wait for response longer.

189960b

Add postfixOps

307527e

style95 force-pushed the add-duration-checker branch from 0032850 to 307527e Compare January 18, 2021 01:24

style95 merged commit a6ad9e4 into apache:master Jan 18, 2021

falkzoll mentioned this pull request Jan 19, 2021

Adapt to scheduler changes in openwhisk. ibm-functions/runtime-nodejs#219

Merged

This was referenced Jan 19, 2021

Adapt to scheduler changes in openwhisk. ibm-functions/runtime-python#132

Merged

Adapt to scheduler changes in openwhisk. ibm-functions/runtime-swift#74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Scheduler] Add duration checker #4984

[New Scheduler] Add duration checker #4984

style95 commented Sep 24, 2020 •

edited

Loading

style95 Sep 25, 2020

bdoyle0182 Sep 25, 2020

style95 Sep 28, 2020

bdoyle0182 Sep 25, 2020

style95 Sep 25, 2020 •

edited

Loading

bdoyle0182 Sep 28, 2020

style95 Sep 29, 2020

bdoyle0182 Sep 29, 2020

bdoyle0182 commented Sep 29, 2020

codecov-io commented Nov 19, 2020 •

edited

Loading

style95 commented Nov 21, 2020

style95 commented Nov 22, 2020

style95 commented Nov 22, 2020

dgrove-oss commented Nov 22, 2020

rabbah commented Nov 26, 2020 via email

style95 commented Jan 18, 2021


		import scala.concurrent.Future

		object NoopDurationCheckerProvider extends DurationCheckerProvider {

[New Scheduler] Add duration checker #4984

[New Scheduler] Add duration checker #4984

Conversation

style95 commented Sep 24, 2020 • edited Loading

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

style95 Sep 25, 2020

Choose a reason for hiding this comment

bdoyle0182 Sep 25, 2020

Choose a reason for hiding this comment

style95 Sep 28, 2020

Choose a reason for hiding this comment

bdoyle0182 Sep 25, 2020

Choose a reason for hiding this comment

style95 Sep 25, 2020 • edited Loading

Choose a reason for hiding this comment

bdoyle0182 Sep 28, 2020

Choose a reason for hiding this comment

style95 Sep 29, 2020

Choose a reason for hiding this comment

bdoyle0182 Sep 29, 2020

Choose a reason for hiding this comment

bdoyle0182 commented Sep 29, 2020

codecov-io commented Nov 19, 2020 • edited Loading

Codecov Report

style95 commented Nov 21, 2020

style95 commented Nov 22, 2020

style95 commented Nov 22, 2020

dgrove-oss commented Nov 22, 2020

rabbah commented Nov 26, 2020 via email

style95 commented Jan 18, 2021

style95 commented Sep 24, 2020 •

edited

Loading

style95 Sep 25, 2020 •

edited

Loading

codecov-io commented Nov 19, 2020 •

edited

Loading